Goto

Collaborating Authors

 rl model


Accelerating Reinforcement Learning Training Using Simulation Surrogate Models

arXiv.org Machine Learning

High-fidelity simulation models are widely used to analyze complex stochastic systems, but their high computational cost motivates the development of cheaper surrogate models that approximate the simulation model's input-output relationship. In parallel, reinforcement learning (RL) has emerged as a powerful framework for making online decisions in stochastic environments, with increasing attention being given to the use of simulation models as training environments for RL models. We investigate a class of surrogate models suitable for accelerating RL training in settings where the reward structure, model parameters, or system dynamics change over time and explore their interactions with simulation models and RL models. Through numerical experiments on a stochastic service system modeled via discrete-event simulation, we demonstrate that leveraging surrogate models can substantially accelerate RL training and re-training.


Fitting Reinforcement Learning Model to Behavioral Data under Bandits

arXiv.org Artificial Intelligence

We consider the problem of fitting a reinforcement learning (RL) model to some given behavioral data under a multi-armed bandit environment. These models have received much attention in recent years for characterizing human and animal decision making behavior. We provide a generic mathematical optimization problem formulation for the fitting problem of a wide range of RL models that appear frequently in scientific research applications, followed by a detailed theoretical analysis of its convexity properties. Based on the theoretical results, we introduce a novel solution method for the fitting problem of RL models based on convex relaxation and optimization. Our method is then evaluated in several simulated bandit environments to compare with some benchmark methods that appear in the literature. Numerical results indicate that our method achieves comparable performance to the state-of-the-art, while significantly reducing computation time. We also provide an open-source Python package for our proposed method to empower researchers to apply it in the analysis of their datasets directly, without prior knowledge of convex optimization.


Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

arXiv.org Artificial Intelligence

Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@k) of LLMs in reasoning tasks, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability as it focuses on improving the accuracy of the easier questions to the detriment of the accuracy of the most difficult questions. Second, we show that RLVR does not merely increase the success probability for the easier questions, but in our small model settings, produces quality responses that were absent in its original output distribution. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in LLMs


Federated Deep Reinforcement Learning for Privacy-Preserving Robotic-Assisted Surgery

arXiv.org Artificial Intelligence

The integration of Reinforcement Learning (RL) into robotic-assisted surgery (RAS) holds significant promise for advancing surgical precision, adaptability, and autonomous decision-making. However, the development of robust RL models in clinical settings is hindered by key challenges, including stringent patient data privacy regulations, limited access to diverse surgical datasets, and high procedural variability. To address these limitations, this paper presents a Federated Deep Reinforcement Learning (FDRL) framework that enables decentralized training of RL models across multiple healthcare institutions without exposing sensitive patient information. A central innovation of the proposed framework is its dynamic policy adaptation mechanism, which allows surgical robots to select and tailor patient-specific policies in real-time, thereby ensuring personalized and Optimised interventions. To uphold rigorous privacy standards while facilitating collaborative learning, the FDRL framework incorporates secure aggregation, differential privacy, and homomorphic encryption techniques. Experimental results demonstrate a 60\% reduction in privacy leakage compared to conventional methods, with surgical precision maintained within a 1.5\% margin of a centralized baseline. This work establishes a foundational approach for adaptive, secure, and patient-centric AI-driven surgical robotics, offering a pathway toward clinical translation and scalable deployment across diverse healthcare environments.


Combining Reinforcement Learning and Behavior Trees for NPCs in Video Games with AMD Schola

arXiv.org Artificial Intelligence

For example, a recent study [1] concludes that NPCs based on behavior trees (BTs) are still more viable than those based on machine learning (ML), calling for new approaches, strategies, and tooling to overcome the barrier to adoption. Additional work has also underscored the need for reusable and adjustable models [2], motivated by game developers' preferences to reuse previously developed assets, provided that reuse does not result in repetitive gameplay. Traditional BT approaches and modern RL techniques each have their respective strengths and limitations in video game development. BTs offer a structured and hierarchical method for managing NPC behaviors, enabling the design of complex systems with predictable outcomes given sufficient development time. However, this complexity can make multi-task BTs less engaging and cumbersome to develop [2]. Conversely, RL provides a dynamic and adaptive approach to decision making [3], allowing developers to guide an agent through trial-and-error. However, training generally-capable RL models remains a challenge, particularly due to reward shaping, negative task transfer [4, 5], and compute resource demands [6].


From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning?

arXiv.org Artificial Intelligence

Protein Language Models (PLMs) have achieved significant breakthroughs in computational protein science through pre-training on large-scale sequence databases and leveraging scalable network architectures. Concurrently, Reinforcement Learning (RL) has demonstrated substantial progress across multiple protein design tasks by enabling expanded exploration capabilities and precise multi-objective optimization. While RL has shown transformative potential in natural language processing by enabling models to discover emergent capabilities beyond their training distributions, its capacity to unlock latent functional patterns within protein sequence space remains underexplored. In this study, we investigate whether RL-enhanced PLMs can transcend their pre-training limitations and identify implicit sequence-structure-function relationships not explicitly encoded in foundational datasets. Through systematic evaluation across four critical protein design domains--antimicrobial peptide (AMP) design, kinase optimization, antibody engineering, and inverse folding--we employ diverse RL algorithms and model architectures to address this fundamental question. Our comprehensive analysis demonstrates that RL reliably improves sampling efficiency across domains and, more importantly, that its effectiveness is governed by a three-factor interaction: task difficulty, reward model accuracy, and policy capacity. Gains scale when rewards are accurate and informative, policies have sufficient capacity to realize the signal, and tasks present headroom beyond supervised learning; conversely, noisy rewards or capacity bottlenecks cap improvements despite exploration. This principled view offers practical guidance for RL in protein design: prioritize reward refinement before scaling policy size, match RL algorithms and regularization strength to task difficulty, and allocate capacity where marginal gains are largest. Implementation is available at github. These advances have successfully propelled the development of sequence-function relationship studies and protein design applications (Qiu et al., 2024; Zhang et al., 2025; Ruffolo et al., 2025). Task difficulty equates to mountain height, policy model capacity to the starting altitude, and reward accuracy to direction correctness.


A Comparative Analysis of Reinforcement Learning and Conventional Deep Learning Approaches for Bearing Fault Diagnosis

arXiv.org Artificial Intelligence

Bearing faults in rotating machinery can lead to significant operational disruptions and maintenance costs. Modern methods for bearing fault diagnosis rely heavily on vibration analysis and machine learning techniques, which often require extensive labeled data and may not adapt well to dynamic environments. This study explores the feasibility of reinforcement learning (RL), specifically Deep Q-Networks (DQNs), for bearing fault classification tasks in machine condition monitoring to enhance the accuracy and adaptability of bearing fault diagnosis. The results demonstrate that while RL models developed in this study can match the performance of traditional supervised learning models under controlled conditions, they excel in adaptability when equipped with optimized reward structures. However, their computational demands highlight areas for further improvement. These findings demonstrate RL's potential to complement traditional methods, paving the way for adaptive diagnostic frameworks.


Semi-on-Demand Transit Feeders with Shared Autonomous Vehicles and Reinforcement-Learning-Based Zonal Dispatching Control

arXiv.org Artificial Intelligence

This paper develops a semi-on-demand transit feeder service using shared autonomous vehicles (SAVs) and zonal dispatching control based on reinforcement learning (RL). This service combines the cost-effectiveness of fixed-route transit with the adaptability of demand-responsive transport to improve accessibility in lower-density areas. Departing from the terminus, SAVs first make scheduled fixed stops, then offer on-demand pick-ups and drop-offs in a pre-determined flexible-route area. Our deep RL model dynamically assigns vehicles to subdivided flexible-route zones in response to real-time demand fluctuations and operations, using a policy gradient algorithm - Proximal Policy Optimization. The methodology is demonstrated through agent-based simulations on a real-world bus route in Munich, Germany. Results show that after efficient training of the RL model, the semi-on-demand service with dynamic zonal control serves 16% more passengers at 13% higher generalized costs on average compared to traditional fixed-route service. The efficiency gain brought by RL control brings 2.4% more passengers at 1.4% higher costs. This study not only showcases the potential of integrating SAV feeders and machine learning techniques into public transit, but also sets the groundwork for further innovations in addressing first-mile-last-mile problems in multimodal transit systems.


Link Prediction for Event Logs in the Process Industry

arXiv.org Artificial Intelligence

Knowledge management (KM) is vital in the process industry for optimizing operations, ensuring safety, and enabling continuous improvement through effective use of operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records, e.g., entries documenting issues related to equipment or processes and the corresponding solutions, may remain disconnected. This fragmentation hinders the recommendation of previous solutions to the users. To address this problem, we investigate record linking (RL) as link prediction, commonly studied in graph-based machine learning, by framing it as a cross-document coreference resolution (CDCR) task enhanced with natural language inference (NLI) and semantic text similarity (STS) by shifting it into the causal inference (CI). We adapt CDCR, traditionally applied in the news domain, into an RL model to operate at the passage level, similar to NLI and STS, while accommodating the process industry's specific text formats, which contain unstructured text and structured record attributes. Our RL model outperformed the best versions of NLI- and STS-driven baselines by 28% (11.43 points) and 27% (11.21 points), respectively. Our work demonstrates how domain adaptation of the state-of-the-art CDCR models, enhanced with reasoning capabilities, can be effectively tailored to the process industry, improving data quality and connectivity in shift logs.


AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor Towards Scaling Workloads

arXiv.org Artificial Intelligence

Efficiently selecting indexes is fundamental to database performance optimization, particularly for systems handling large-scale analytical workloads. While deep reinforcement learning (DRL) has shown promise in automating index selection through its ability to learn from experience, few works address how these RL-based index advisors can adapt to scaling workloads due to exponentially growing action spaces and heavy trial and error. To address these challenges, we introduce AutoIndexer, a framework that combines workload compression, query optimization, and specialized RL models to scale index selection effectively. By operating on compressed workloads, AutoIndexer substantially lowers search complexity without sacrificing much index quality. Extensive evaluations show that it reduces end-to-end query execution time by up to 95% versus non-indexed baselines. On average, it outperforms state-of-the-art RL-based index advisors by approximately 20% in workload cost savings while cutting tuning time by over 50%. These results affirm AutoIndexer's practicality for large and diverse workloads.